Lecture 4

Visualizing and summarizing data

Prior to this lecture, you should have read chapters 2 and 3 of Regression and Other Stories.

Always plot your data!

All of these data sets have the same average and standard deviation for their x and y variables, and the same correlation between x and y.

Visualizing the distribution of a continuous variable

A histogram is a common and useful way to visualize the distribution of a continuous variable. A histogram is a special bar chart where the categories along the x-axis are bins (i.e. ranges) of values for a variable, and the heigth of the bars indicates the number of observations within each bin.

If you have a data set loaded into R as a data frame, you can quickly create a histogram using the geom_histogram() function within ggplot(). In this example, I have a data frame called crashes and one of the variables in that data frame is called crashes_per_km2. The bins argument indicates the number of bins I want in my histogram.

ggplot(crashes) +
  geom_histogram(aes(x = crashes_per_km2),
                 color = "black",
                 fill = "pink",
                 bins = 30) +
  theme_minimal()

You can also create a histogram in Excel. You need to define a set of threshold values for your bins, then use the “Histogram” tool in the “Data Analysis” toolpack, which you’ll find on the “Data” ribbon if you’ve enabled it.

A jittered (or one-dimensional) scatter plot can be another good way to visualize the distribution of a variable. Making your points very small (size = 0.5) and transparent (alpha = 0.5) can be helpful if you have a lot of points plotted on top of one another (a problem called overplotting).

ggplot(crashes) +
  geom_point(aes(x = crashes_per_km2, y = 0),
             position = "jitter",
             size = 0.5,
             alpha = 0.5) +
  scale_y_continuous(name = "",
                     breaks = c()) +
  theme_minimal() 

Visualizing the relationship between two continuous variables

A more typical use of a scatter plot is to visualize the relationship between two continuous variables. In this example, I want to see the relationship between population density (people per square kilometer) and crash density (crashes per square kilometer). It is standard to have your outcome variable on the y-axis.

ggplot(crashes) +
  geom_point(aes(x = ppl_per_km2, 
                 y = crashes_per_km2),
             size = 0.5,
             alpha = 0.5) +
  theme_minimal() 

It’s straightforward to create a plot like this in Excel as well.

By default, Excel will guess what values you want on your x and y axes, and it sometimes guesses wrong. I recommend clearing out all the data series and adding in the one you want manually to be sure that you’re plotting what you think you’re plotting.

The scatterplot above is for a dataset with 187 observations. Here is a scatterplot for a dataset with over 800,000 observations.

ggplot(commuting) +
  geom_point(aes(x = INCTOT, 
                 y = TRANTIME),
             size = 0.5,
             alpha = 0.5) +
  theme_minimal() 

Overplotting is more of a problem in this case, and a heat map might be more legible. This divides the plot area into a grid, with variation in color representing the number of points within each grid cell.

ggplot(commuting) +
  geom_bin_2d(aes(x = INCTOT, 
                 y = TRANTIME)) +
  scale_fill_viridis_c() +
  theme_minimal() 

You can also create a heat map from a hexagonal grid, which can sometimes make patterns a little more legible.

ggplot(commuting) +
  geom_hex(aes(x = INCTOT, 
                 y = TRANTIME)) +
  scale_fill_viridis_c() +
  theme_minimal() 

There isn’t a simple way to produce this type of heat map in Excel, although you could make something that looks a rectangular gridded heat map by using through a combination of manually defined bins, a pivot table, and conditional cell formatting.

Decribing continuous variables

You can describe a continuous variable in terms of its central tendency and its spread.

Central tendency

The most common measure of central tendency is an average, which is also called a mean.

If you have a data set loaded into R as a data frame, you can calculate the average of one of the variables using the mean() function. In this example, I have a data frame called crashes and one of the variables in that data frame is called crashes_per_km2.

mean(crashes$crashes_per_km2)
## [1] 368.2335

In Excel, you can use the AVERAGE() function.

One problem with using an average value to describe your data is that a single observation can have a big influence on the average if it has very high or very low values (we would describe this as an outlier). A median is an alternative measure of central tendency. Half of the observations in your dataset have values below the median, and half the values in your dataset have values above the median.

You can calculate a median with the median() function in R or the MEDIAN() function in Excel.

Here are are histograms for two different variables with the same average and very different medians.

Spread

A common measure of spread for a continuous variable is its standard deviation.

You can calculate a standard deviation using the sd() function in R, or the STDEV.P() function in Excel.

A smaller standard deviation means the values are less spread out. Here are histograms for two variables that have the same mean, but different standard deviations.

A standard deviation is related to the average difference between the value for an individual observation and the average value, and it is sensitive to outliers. A measure of spread that is less sensitive to outliers is the interquartile range, or the range of values between the 25th percentile (i.e. the value that 25 percent of your observations will be below) and the 75th percentile (i.e the value that 75 percent of your observations will be above).

In R, you can calculate percentile values using the quantile(). Here’s how you would calculate the 25th, 50th, and 75th percentiles (the 50th percentile is the median).

quantile(crashes$crashes_per_km2, probs = c(0.25, 0.5, 0.75))
##      25%      50%      75% 
## 154.8500 324.3568 519.0817

In Excel, you can calculate percentile values using PERCENTILE.INC().

Here are two distributions that are very different in terms of mean and standard deviation, but similar in terms of median and interquartile range.

One more measure of the spread of a variable that is very sensitive to outliers is the range of the data: the difference between the minumum and maximum values.

You can calculate the minumum value of a variable using the min() function in R or the MIN() function in Excel.

You can also quickly get a summary of a variable’s range, interquartile range, mean, and median using the summary() function in R.

summary(crashes$crashes_per_km2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34.93  154.85  324.36  368.23  519.08 1906.71

Both standard deviation and range are sensitive to outliers and extreme values, but they don’t measure the same property. Here are two distributions that have the same range, but different standard deviations.

Describing the distribution of a continuous variable

In real life, it’s common for variables to fall into histograms that are shaped kind of like a bell (a bell curve). This is so common, that we call it a normal distribution. It looks something like this, with more values closer the the average value, and fewer values farther from the average. It’s symmetrical in the sense that values are as likely to be below average as above average (so the median will be almost the same as the mean).

The methods we discuss in this class are based on an assumption that your continuous variables have an approximately normal distribution. Sometimes they don’t. This one doesn’t.

When you have a lot of lower values and fewer higher values like this, you can call this a left-skewed distribution (because there are more values on the left side of the histogram than on the right). A left-skewed distribution might indicate a log-normal distribution, where the logarithms of the values approximate a normal distribution better than the values themselves do.

You might also have a bi-modal distribution like the one below. Where the histogram shows two distinct humps. This might suggest you have two different populations (or categories) in your data, and you might need to account for that in your analysis somehow.

Summarizing a categorical variable

Variation in a categorical variable is somewhat simpler. You’ve more or less described it fully when you’ve reported the proportion of your sample in each category.

If I have a data frame in R called commuting, with a variable called mode to indicate the typical mode of transportation to work, I can get the number of people who report using each mode like this:

commuting %>%
  group_by(mode) %>%
  summarise(number = n())
## # A tibble: 8 × 2
##   mode       number
##   <chr>       <int>
## 1 Bike         8390
## 2 Car        715885
## 3 Ferry         656
## 4 Motorcycle   3011
## 5 Other        8555
## 6 Taxi         1460
## 7 Transit     42950
## 8 Walk        23893

And I can show that as the proportions by dividing by the total sample size (which would be the length of any of the data frame’s columns).

commuting %>%
  group_by(mode) %>%
  summarise(number = n()) %>%
  mutate(proportion = number / length(commuting$mode)) 
## # A tibble: 8 × 3
##   mode       number proportion
##   <chr>       <int>      <dbl>
## 1 Bike         8390   0.0104  
## 2 Car        715885   0.890   
## 3 Ferry         656   0.000815
## 4 Motorcycle   3011   0.00374 
## 5 Other        8555   0.0106  
## 6 Taxi         1460   0.00181 
## 7 Transit     42950   0.0534  
## 8 Walk        23893   0.0297

I can do something similar in Excel using a pivot table:

Visualizing proportions

Bar charts

A bar chart is a reasonable way to visualize proportions. You can create one in R using geom_bar().

commuting %>%
  group_by(mode) %>%
  summarise(number = n()) %>%
  mutate(proportion = number / length(commuting$mode)) %>%
  
  ggplot() +
    geom_bar(aes(x = mode, y = proportion),
             stat = "identity") +
    scale_y_continuous(breaks = breaks <- seq(0, 1, by = 0.1),
                       labels = paste0(breaks * 100, "%")) +
    theme_minimal()  

And you can do something similar in Excel.

When you have some very small categories, a tree map can be more legible than a bar chart.

library(treemapify)

commuting %>%
  group_by(mode) %>%
  summarise(number = n()) %>%
  mutate(proportion = number / length(commuting$mode)) %>%
  
  ggplot() +
    geom_treemap(aes(area = proportion, fill = mode)) +
    geom_treemap_text(aes(area = proportion, label = mode),
                      color = "white") +
    scale_fill_brewer(type = "qual", 
                      guide = "none") # leave off the legend

And you can do something similar in Excel.